minmax fp32
5b5618e7d061748267d74478b7c5b1ab-Supplemental-Conference.pdf
Wav2vec2.0-large is a speech model pre-trained on the audio data from LibriVox (LV-60k) [5] in a self-supervised manner [6]. In this work, we use the Wav2vec2.0-large The hidden dimension, inner dimension, and number of attention heads in each transformer block are 1024, 4096 and 16, respectively. The pre-trained model is fine-tuned on Librispeech's 100 hour clean subset using standard Connectionist Temporal Classification (CTC) loss. We follow the implementation and settings from HuggingFace Transformer [7] for the fine-tuning.
Technology: